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Abstract 

Data mining techniques, including clustering and classification tasks, for 
the automatic information extraction from large datasets are increasingly 
demanded in several scientific fields. In particular, in the astrophysical field, 
large archives and digital sky surveys with dimensions of 10 12 bytes currently 
exist, while in the near future they will reach sizes of the order of 10 15 . 
In this work we propose a multidimensional indexing method to efficiently 
query and mine large astrophysical datasets. A novelty detection algorithm, 
based on the Support Vector Clustering and using density and neighborhood 
information stored in the index structure, is proposed to find regions of 
interest in data characterized by isotropic noise. We show an application of 
this method for the detection of point sources from a gamma-ray photon list. 


1 Characterization of the astrophysical datasets 

At present, several projects for the multi-wavelength observation of the Universe 
are underway, for example SDSS, GALEX, POSS2, DENIS, etc. In the next years, 
new spatial missions will be launched (e.g. GLAST, Swift), surveying the wall sky 
at different wavelengths (gamma-ray, X-ray, optical). In the Astroparticle and 
Astrophysical fields, data is mostly characterized by multidimensional arrays. For 
instance, in X-ray and Gamma-ray astronomy, the data gathered by detectors 
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are lists of detected photons whose properties include position (RA, DEC), arrival 
time, energy, error measures both for the position and the energy estimates, quality 
measures of the events . Source catalogs, produced by the analysis of the raw data, 
are lists of point and extended sources characterized by coordinates, magnitude, 
spectral indexes, flux, etc. 

1.1 MINING MULTIDIMENSIONAL DATA 

Data mining applied to multidimensional data analyzes the relationships between 
the attributes of a multidimensional object stored into the database and the at¬ 
tributes of the neighboring ones. Several queries are required by this kind of 
analysis: 

• point queries , to find all objects overlapping the query point; 

• range queries , to find all objects having at least one common point with a 
query window; 

• nearest-neighbor queries , to find all objects that have a minimum distance 
from the query object. 

Another important operation is the spatial join , which in the astrophysical field is 
needed to search multiple source catalogs and cross-identify sources from different 
wavebands. This multidimensional (spatial) data tend to be large (sky maps can 
reach sizes of Terabytes) requiring the integration of the secondary storage, and 
there is no total ordering on spatial objects preserving spatial proximity. This 
characteristic makes it di cult to use traditional indexing methods, like B-b-trees 
or linear hashing. 


2 An optimized R-tree structure 

The R-tree is a data structure meant to efficiently index multidimensional point 
data or objects with a spatial extent. The structure of an R-tree is the following: 

• an inner node of the R-tree has entries of the form (cp, MBB), where 
cp is the address of a child node and MBB is the n-dimensional Minimum 
Bounding Box of all entries in that child node; 

• a leaf node has entries of the form (cp, MBB), where cp refers to a record 
describing a particular object and MBB is its minimum bounding box, or 
(Point, Attributes ), where Point is a coordinate in the n-dimensional space 
and Attributes are data associated to that point. 

An optimized index, in terms of construction time, memory occupied and query 
performances, can be built using a priori information on the dataset by means 
of bulk loading algorithms. We have followed a top-down construction method 
called VAMSplit algorithm to build and optimized R-Tree. This method preserves 
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Figure 1. The structure of the optimized R-tree built a photon dataset 


the spatial proximity between sibling nodes, resulting in a partition of the dataset 
with no overlapping between MBBs. Moreover, the volume of the data space 
covered by each node (at a particular level) is variable and dependent on data 
density. The main idea of this method is to recursively split the dataset on a near 
median element along the dimension with maximum variance. In particular, at 
each recursive step, child subtrees capacity is calculated by: 

cscap= B-f\ 1oSf Mb 1 

where B is the page capacity, F is the internal fanout and N is the number of 
elements to index in the current step. The near median element is computed by: 

med = cscap • 

Our implementation uses a sampling strategy to find a good pivot value in the par¬ 
tition step and reduce the number of I/O operations; a caching strategy has been 
adopted to partition the data into the secondary storage. The total construction 
time is 0(^ logM ^). 


N 

cscap 


3 A scalable novelty detection algorithm 

The structure of the optimized R-tree can help exploring the data and finding 
regions of interest. For this purpose, other information can be added to each node: 
the total number of data points covered by the node, their mean and variance, 
other statistical moments. In this work we propose a novelty detection algorithm 
based on the Support Vector Clustering (SVC). 
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3.1 THE SVC ALGORITHM 

The SVC algorithm estimates the support of a high dimensional distribution. It 
computes the hypersphere with minimal radius which encloses the data points 
when mapped to a high dimensional feature space. Given a set of points xi,..., x^r, 
with E X C it finds the hypersphere (c, r) that solves the optimization 
problem: 

N 

min r 2 + C pi 

c,r,p ' ^ 

i= 1 

s.t. ||0(xj) - c|| < r 2 + pi 
Pi > 0, i = 1,..., N 

where pi are slack variables and C is a positive constant. Defining the Lagrangian, 
the solution is obtained solving the dual problem: 

N N 

max aik(xi,-Xi) - Y QiiQtjk^Ki, xA 
i=l i,j= 1 

s.t. 0 < di < C, i = 1, ..., N 

where oti are the lagrange multipliers and /c(xi,Xj) = (</>(xi), </>(xj)). 

3.2 FEATURES FROM THE R-TREE 

The partition generated at a given level of the optimized R-tree is used as the 
input space of the novelty detection algorithm. For each node of the partition, the 
input parameters include the center c of its bounding box and the density (the 
ratio between the number n of elements covered by the node and the volume V of 
its bounding box). These features are not orthogonal. Therefore, we first apply 
the PC A method to find the directions along which the variance is higher and 
project the features on the corresponding eigenvectors. The projected features are 
passed to the SVC algorithm. 

3.3 GAMMA SOURCE DETECTION 

Point sources are mostly characterized by a stronger flux, with respect to the 
surrounding, focused on a small angular region. The area covered by a point 
source depends also on the instrument point spread function. An optimized R- 
tree index can be built on a dataset including photons gathered in a certain range of 
time (we are using, for the analysis, a minimum interval of 6 days). To find static 
or strong variable sources (e.g. gamma-ray bursts) we use only a bidimensional 
indexing on the RA and DEC values. The algorithm estimates the support of the 
diffuse background. The output of the SVC algorithm is filtered to single out the 
nodes with lowest density. Point sources are considered as the remaining outliers. 
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Figure 2. The novelty detection algorithm applied to the anticenter 


Figure |3 shows the application of our method to the anticenter region. Green 
boxes represent the background (support) while yellow boxes are support vectors 
and the red ones are the outliers. In particular,the three major sources of the 
anticenter are highlighted as novelty. 


4 Conclusions 

In this work we have realized a multidimensional indexing method to efficiently 
access and mine large multidimensional astrophysical data. The index adapts the 
VAMSRtree to large datasets. The partition generated by the optimized R-tree is 
used to scale the SVC algorithm and find regions of interest where a more accurate 
analysis can be performed. 
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